{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Q-Learning in the GridWorld environment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q-learning was an early RL breakthrough when it was developed by Chris Watkins for his [PhD thesis](http://www.cs.rhul.ac.uk/~chrisw/thesis.html) in 1989. It introduces incremental dynamic programming to control an MDP without knowing or modeling the transition and reward matrices that we used for value and policy iteration in the previous section. A convergence proof followed three years later by [Watkins and Dayan](http://www.gatsby.ucl.ac.uk/~dayan/papers/wd92.html).\n",
    "\n",
    "Q-learning directly optimizes the action-value function, q, to approximate q*. The learning proceeds off-policy, that is, the algorithm does not need to select actions based on the policy that's implied by the value function alone. However, convergence requires that all state-action pairs continue to be updated throughout the training process. A straightforward way to ensure this is by using an ε-greedy policy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Q-learning algorithm keeps improving a state-action value function after random initialization for a given number of episodes. At each time step, it chooses an action based on an ε-greedy policy, and uses a learning rate, α, to update the value function, as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$Q(S_t, A_t)\\leftarrow  Q(S_t, A_t) + \\alpha\\left[R_{t_1}+\\gamma \\max_a Q(S_{t+1},,a) − Q(S_t, A_t)\\right]$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the algorithm does not compute expected values because it does know the transition probabilities. It learns the Q function from the rewards produced by the ε-greedy policy and its current estimate of the value function for the next state.\n",
    "\n",
    "The use of the estimated value function to improve this estimate is called bootstrapping. The Q-learning algorithm is part of the TD learning algorithms. TD learning does not wait until receiving the final reward for an episode. Instead, it updates its estimates using the values of intermediate states that are closer to the final reward. In this case, the intermediate state is one time step ahead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook demonstrates how to build a Q-learning agent using the 3 x 4 grid of states from the Dynamic Programmimg [example](01_gridworld_dynamic_programming.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from pathlib import Path\n",
    "from time import process_time\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from mdptoolbox import mdp\n",
    "from itertools import product\n",
    "import gym"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up Gridworld"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first create our small gridworld as in the Dynamic Programmimg [example](01_gridworld_dynamic_programming.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### States, Actions and Rewards"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "grid_size = (3, 4)\n",
    "blocked_cell = (1, 1)\n",
    "baseline_reward = -0.02\n",
    "absorbing_cells = {(0, 3): 1, (1, 3): -1}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "actions = ['L', 'U', 'R', 'D']\n",
    "num_actions = len(actions)\n",
    "probs = [.1, .8, .1, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_1d = lambda x: np.ravel_multi_index(x, grid_size)\n",
    "to_2d = lambda x: np.unravel_index(x, grid_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_states = np.product(grid_size)\n",
    "cells = list(np.ndindex(grid_size))\n",
    "states = list(range(len(cells)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "cell_state = dict(zip(cells, states))\n",
    "state_cell= dict(zip(states, cells))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "absorbing_states = {to_1d(s):r for s, r in absorbing_cells.items()}\n",
    "blocked_state = to_1d(blocked_cell)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_rewards = np.full(num_states, baseline_reward)\n",
    "state_rewards[blocked_state] = 0\n",
    "for state, reward in absorbing_states.items():\n",
    "    state_rewards[state] = reward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "action_outcomes = {}\n",
    "for i, action in enumerate(actions):\n",
    "    probs_ = dict(zip([actions[j % 4] for j in range(i, num_actions + i)], probs))\n",
    "    action_outcomes[actions[(i + 1) % 4]] = probs_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'U': {'L': 0.1, 'U': 0.8, 'R': 0.1, 'D': 0},\n",
       " 'R': {'U': 0.1, 'R': 0.8, 'D': 0.1, 'L': 0},\n",
       " 'D': {'R': 0.1, 'D': 0.8, 'L': 0.1, 'U': 0},\n",
       " 'L': {'D': 0.1, 'L': 0.8, 'U': 0.1, 'R': 0}}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "action_outcomes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transition Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_new_cell(state, move):\n",
    "    cell = to_2d(state)\n",
    "    if actions[move] == 'U':\n",
    "        return cell[0] - 1, cell[1]\n",
    "    elif actions[move] == 'D':\n",
    "        return cell[0] + 1, cell[1]\n",
    "    elif actions[move] == 'R':\n",
    "        return cell[0], cell[1] + 1\n",
    "    elif actions[move] == 'L':\n",
    "        return cell[0], cell[1] - 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.02, -0.02, -0.02,  1.  , -0.02,  0.  , -0.02, -1.  , -0.02,\n",
       "       -0.02, -0.02, -0.02])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "state_rewards"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def update_transitions_and_rewards(state, action, outcome):\n",
    "    if state in absorbing_states.keys() or state == blocked_state:\n",
    "        transitions[action, state, state] = 1\n",
    "    else:\n",
    "        new_cell = get_new_cell(state, outcome)\n",
    "        p = action_outcomes[actions[action]][actions[outcome]]\n",
    "        if new_cell not in cells or new_cell == blocked_cell:\n",
    "            transitions[action, state, state] += p\n",
    "            rewards[action, state, state] = baseline_reward\n",
    "        else:\n",
    "            new_state= to_1d(new_cell)\n",
    "            transitions[action, state, new_state] = p\n",
    "            rewards[action, state, new_state] = state_rewards[new_state]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "rewards = np.zeros(shape=(num_actions, num_states, num_states))\n",
    "transitions = np.zeros((num_actions, num_states, num_states))\n",
    "actions_ = list(range(num_actions))\n",
    "for action, outcome, state in product(actions_, actions_, states):\n",
    "    update_transitions_and_rewards(state, action, outcome)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((4, 12, 12), (4, 12, 12))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rewards.shape, transitions.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q-Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " We will train the agent for 2,500 episodes, use a learning rate of α= 0.1, and an ε=0.05 for the ε-greedy policy "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_episodes = 2500\n",
    "alpha = .1\n",
    "epsilon = .05\n",
    "gamma = .99"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we will randomly initialize the state-action value function as a NumPy array with the dimensions of [number of states and number of actions]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "Q = np.random.rand(num_states, num_actions)\n",
    "skip_states = list(absorbing_states.keys())+[blocked_state]\n",
    "Q[skip_states] = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The algorithm generates 2,500 episodes that start at a random location and proceed according to the ε-greedy policy until termination, updating the value function according to the Q-learning rule:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5974638589999994"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start = process_time()\n",
    "for episode in range(max_episodes):\n",
    "    state = np.random.choice([s for s in states if s not in skip_states])\n",
    "    while not state in absorbing_states.keys():\n",
    "        if np.random.rand() < epsilon:\n",
    "            action = np.random.choice(num_actions)\n",
    "        else:\n",
    "            action = np.argmax(Q[state])\n",
    "        next_state = np.random.choice(states, p=transitions[action, state])\n",
    "        reward = rewards[action, state, next_state]\n",
    "        Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state])-Q[state, action])\n",
    "        state = next_state\n",
    "process_time() - start"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The episodes take 0.6 seconds and converge to a value function fairly close to the result of the value iteration example from the previous section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>R</td>\n",
       "      <td>R</td>\n",
       "      <td>R</td>\n",
       "      <td>L</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>U</td>\n",
       "      <td>L</td>\n",
       "      <td>L</td>\n",
       "      <td>L</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>U</td>\n",
       "      <td>L</td>\n",
       "      <td>L</td>\n",
       "      <td>D</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0  1  2  3\n",
       "0  R  R  R  L\n",
       "1  U  L  L  L\n",
       "2  U  L  L  D"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(np.argmax(Q, 1).reshape(grid_size)).replace(dict(enumerate(actions)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.858871</td>\n",
       "      <td>0.922653</td>\n",
       "      <td>0.986901</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.834113</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.717010</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.800174</td>\n",
       "      <td>0.763851</td>\n",
       "      <td>0.730269</td>\n",
       "      <td>0.571109</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2         3\n",
       "0  0.858871  0.922653  0.986901  0.000000\n",
       "1  0.834113  0.000000  0.717010  0.000000\n",
       "2  0.800174  0.763851  0.730269  0.571109"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(np.max(Q, 1).reshape(grid_size))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PyMDPToolbox"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Q Learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Time: 10.5962'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "start = process_time()\n",
    "ql = mdp.QLearning(transitions=transitions,\n",
    "                   reward=rewards,\n",
    "                   discount=gamma,\n",
    "                   n_iter=int(1e6))\n",
    "\n",
    "ql.run()\n",
    "f'Time: {process_time()-start:.4f}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>R</td>\n",
       "      <td>R</td>\n",
       "      <td>R</td>\n",
       "      <td>L</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>U</td>\n",
       "      <td>L</td>\n",
       "      <td>U</td>\n",
       "      <td>L</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>R</td>\n",
       "      <td>R</td>\n",
       "      <td>U</td>\n",
       "      <td>L</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0  1  2  3\n",
       "0  R  R  R  L\n",
       "1  U  L  U  L\n",
       "2  R  R  U  L"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "policy = np.asarray([actions[i] for i in ql.policy])\n",
    "pd.DataFrame(policy.reshape(grid_size))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.768113</td>\n",
       "      <td>0.918608</td>\n",
       "      <td>0.967810</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.499035</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.737283</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.226344</td>\n",
       "      <td>0.499552</td>\n",
       "      <td>0.616478</td>\n",
       "      <td>0.269537</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2         3\n",
       "0  0.768113  0.918608  0.967810  0.000000\n",
       "1  0.499035  0.000000  0.737283  0.000000\n",
       "2  0.226344  0.499552  0.616478  0.269537"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "value = np.asarray(ql.V).reshape(grid_size)\n",
    "pd.DataFrame(value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
